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Coronaviruses are found in a diverse array of bat and bird species, which are believed to act as natural hosts. Molecular clock 
dating analyses of coronaviruses suggest that the most recent common ancestor of these viruses existed around 10,000 years ago. 
This relatively young age is in sharp contrast to the ancient evolutionary history of their putative natural hosts, which began di- 
versifying tens of millions of years ago. Here, we attempted to resolve this discrepancy by applying more realistic evolutionary 
models that have previously revealed the ancient evolutionary history of other RNA viruses. By explicitly modeling variation in 


the strength of natural selection over time and thereby improving the modeling of substitution saturation, we found that the 
time to the most recent ancestor common for all coronaviruses is likely far greater (millions of years) than the previously in- 


ferred range. 


CS (family Coronaviridae, subfamily Coronavirinae) 
are important pathogens of birds and mammals. Coronaviruses 
are positive-sense RNA viruses and are currently classified into four 
genera: Alphacoronavirus, Betacoronavirus, Gammacoronavirus, and 
Deltacoronavirus (1). Alphacoronaviruses and betacoronaviruses are 
found exclusively in mammals, whereas gammacoronaviruses and 
deltacoronaviruses primarily infect birds. The identification of se- 
vere acute respiratory syndrome (SARS) coronavirus in 2003 (2) 
prompted an intensive search for novel coronaviruses, resulting in 
the detection ofa number of novel coronaviruses in humans, domes- 
ticated animals, and wildlife (3—7). Interestingly, surveillance of coro- 
naviruses in wild animals has led to the discovery of the greatest di- 
versity of coronaviruses in bat and avian species, which suggests that 
these animals are the natural reservoirs of the viruses (8-10). Indeed, 
phylogenetic studies of bat and avian coronaviruses suggest an an- 
cient relationship with possible codivergence and coevolution with 
their hosts. Conversely, many coronaviruses found in bats and other 
mammals diverged near the tips of coronavirus phylogeny, suggest- 
ing that these viruses were the result of recent cross-species transmis- 
sion events (9, 11). 

Molecular clock analysis based on the RNA-dependent RNA 
polymerase (RdRp) genomic region suggests a time of most recent 
common ancestor (t(MRCA) for the four coronavirus genera of 
around 10,100 years ago, with a mean rate of 1.3 X 10 * nucleo- 
tide substitutions per site per year (10). This tMRCA estimate is 
difficult to reconcile with a hypothetical ancient, coevolutionary 
relationship between coronaviruses and their bat or bird hosts (12, 
13). Moreover, a group of genetically related, yet distinct, alph- 
acoronaviruses have been detected in different mouse-eared bats 
(Myotis spp.) on multiple continents. However, these bat species 
do not migrate long distances, with few traveling farther than hun- 
dreds of miles to overwinter sites (Gary F. McCracken, personal 
communication). Yet, the tMRCA of Alphacoronavirus is esti- 
mated to be around 200 or 4,400 years ago, on the basis of analyses 
of helicase (9) and RdRp (10), respectively. The limited interac- 
tion among these bat populations suggests a more ancient evolu- 
tionary association with alphacoronaviruses (i.e., codivergence or 
coevolution), which is incompatible with the relatively young viral 
tMRCAs. Notably, coronaviruses have a unique proofreading 
mechanism for viral RNA replication (3, 14); because of the exori- 
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bonuclease activity of viral nonstructural protein 14 (Nsp14), the 
mutation rate of coronaviruses has been found to be similar to 
that of single-stranded DNA viruses (~1 X 10 °to1 xX 10° 
mutation per site per replication) and well below those measured 
in other RNA viruses (~1 X 107° to 1 X 107° mutation per site 
per replication) (15, 16). For these reasons, we speculate that there 
is a substantial underestimation of the length of the natural evo- 
lutionary history of coronaviruses. 

Recent developments in the modeling of the evolution of RNA 
viruses have revealed that purifying selection can mask the ancient 
age of viruses that appear to have recent origins, according to a 
molecular clock (e.g., measles, Ebola, and avian influenza viruses) 
(17). Strong purifying selection can maintain evidence of se- 
quence homology long after saturation has occurred at synony- 
mous sites; this phenomenon can lead to underestimation of the 
overall depth of a viral phylogeny. In the absence of strong puri- 
fying selection, nucleotide sequences would diverge more quickly, 
lose detectable homology, and become difficult to align and com- 
pare. Here, we asked if similar evolutionary patterns led to under- 
estimation of the tMRCA of the coronavirus lineage. We em- 
ployed evolutionary models that account for variation in the 
pressure of natural selection across sites in viral loci and lineages in 
their phylogenies. Our results indicate that coronaviruses are or- 
ders of magnitude older than suggested by previous molecular 
clock analyses. 


MATERIALS AND METHODS 

Sequence data sets. Representative coronavirus genomes (n = 43; Table 
1) from the four genera were selected for this study. For several viral taxa, 
genomic sequences from multiple sampling years were included, though 
these dates were not used explicitly in our analyses. To avoid highly re- 
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TABLE 1 Coronavirus sequences analyzed in this study 


Genus and virus name Host species Strain Accession no. Sampling yr 
Alphacoronavirus 
Bat coronavirus 1A* Miniopterus magnater AFCD62 EU420138 2005 
Bat coronavirus 1B* Miniopterus pusillus AFCD307 EU420137 2006 
Bat coronavirus HKU2 Rhinolophus sinicus HKU2/GD/430/2006 EF203064 2006 
Bat coronavirus HKU8 Miniopterus pusillus AFCD77 EU420139 2005 
Feline coronavirus? Felis catus UU2 FJ938060 1993 
Feline coronavirus? Felis catus UU54 JN183883 2010 
Human coronavirus NL63 Homo sapiens NL63/DEN/2009/14 JQ765564 2009 
Human coronavirus NL63 Homo sapiens NL63/DEN/2005/1876 JQ765575 2005 
Porcine epidemic diarrhea virus Sus scrofa CH/FJND-3 JQ282909 2011 
Porcine epidemic diarrhea virus Sus scrofa CH/S JN547228 1986 
Transmissible gastroenteritis virus? Sus scrofa H16 FJ755618 1973 
Transmissible gastroenteritis virus” Sus scrofa Purdue DQ811789 1952 
Betacoronavirus 
Bat coronavirus HKUS5 Pipistrellus abramus TTO7£ EF065512 2006 
Bat coronavirus HKU9 Rousettus leschenaulti 10-1 HM211100 2006 
Bat coronavirus/133/2005 Tylonycteris pachypus BtCoV/133/2005 DQ648794 2005 
Bat SARS coronavirus‘ Rhinolophus pearsoni Rp3 DQ071615 2004 
Bovine coronavirus@ Bos taurus DB2 DQ811784 1983 
Bovine coronavirus® Bos taurus E-AH187-TC FJ938064 2000 
Civet SARS coronavirus‘ Paguma larvata SZ3 AY304486 2003 
Human coronavirus HKU1 Homo sapiens Caenl HM034837 2005 
Human coronavirus OC434 Homo sapiens HK04-02 JN129835 2004 
Human SARS coronavirus® Homo sapiens HKU-39849 isolate UOB JQ316196 2003 
Mouse hepatitis virus Mus musculus MHV-MI AB551247 1994 
Mouse hepatitis virus RA59/R13 Mus musculus RA59/R13 FJ647218 2006 
Sammbar deer coronavirus Cervus unicolor US/OH-WD388-TC FJ425188 1994 
Gammacoronavirus 
Duck coronavirus Duck’ DK/CH/HN/ZZ2004 JF705860 2004 
Infectious bronchitis virus® Gallus gallus Holte GU393336 1954 
Infectious bronchitis virus® Gallus gallus ck/CH/LHLJ/100902 JF828980 2010 
Infectious bronchitis virus® Gallus gallus Conn46 FJ904719 1991 
Infectious bronchitis virus® Gallus gallus Mass41 FJ904721 1972 
Infectious bronchitis virus® Gallus gallus Massachusetts GQ504724 1941 
Turkey coronavirus® Meleagris gallopavo IN-517 GQ427175 1994 
Turkey coronavirus® Meleagris gallopavo TX-1038 GQ427176 1998 
Turkey coronavirus® Meleagris gallopavo VA-74 GQ427173 2003 
Deltacoronavirus 
Bulbul coronavirus HKU11 Pycnonotus jocosus HKU11-934 FJ376619 2007 
Common-moorhen coronavirus HKU21 Gallinula chloropus HKU21-8295 JQ065049 2007 
Magpie robin coronavirus HKU18 Copsychus saularis HKU18-chu3 JQ065046 2007 
Munia coronavirus HKU13 Lonshura striata HKU13-3514 FJ376622 2007 
Night-heron coronavirus HKU19 Nycticorax nycticorax HKU19-6918 JQ065047 2007 
Sparrow coronavirus HKU17 Passer montanus HKU17-6124 JQ065045 2007 
Thrush coronavirus HKU12 Turdus hortulorum HKU12-600 FJ376621 2007 
White-eye coronavirus HKU16 Zosterops sp. HKU 16-6847 JQ065044 2007 
Wigeon coronavirus HKU20 Anas penelope HKU20-9243 JQ065048 2008 


* Viruses are classified as a single species (Miniopterus bat coronavirus 1). 
» Viruses are classified as a single species (Alphacoronavirus 1). 


© Viruses are classified as a single species (Severe acute respiratory syndrome-related coronavirus). 


4 Viruses are classified as a single species (Betacoronavirus 1). 
© Viruses are classified as a single species (Avian coronavirus). 
f Host species cannot be determined. 


combinant sequences and/or sequence misalignments, we specifically se- 
lected partial viral sequences from five relatively conserved genomic re- 
gions (Fig. 1 and Table 1): Nsp15-16 (1,320 nucleotides [nt]), the matrix 
protein (640 nt), papain-like protease 2 (PLP2; 620 nt), the RdRp (1,860 
nt), and the Y domain (400 nt) (18). These nucleotide sequences were 
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aligned at the amino acid level by MUSCLE (19) as described previously 
(20). 

Phylogenetic analyses. Each of the five target sequences was screened 
for recombination with a genetic algorithm tool, GARD (21). Maximum- 
likelihood phylogenies were constructed for each nonrecombinant region 
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FIG 1 Schematic diagram of the SARS coronavirus genome. Open reading frame 1a (ORFla) and ORF1b, encoding the nonstructural polyproteins, and those 
encoding the S, E, M, and N structural proteins are indicated. The approximate locations of the viral sequences used in this study are shown. 


with PHYML 3.0 (22), implemented in Seaview 4.0 (23), with a subtree- 
pruning and regrafting search algorithm and the general time-reversible 
substitution model with a four-bin gamma rate distribution (GTR + I’,). 

With these topologies, branch lengths were re-estimated in HyPhy 
(24) under GTR + [, and a branch site random effects likelihood (BS- 
REL) model (25). The BS-REL model was implemented to account for the 
effects of variable selection pressure across codon sites and phylogenetic 
lineages by inferring three unconstrained w classes (dN/dS: nonsynony- 
mous substitution rate/synonymous substitution rate) for each branch 
and estimating the proportion of sites in the alignment that evolved under 
each w class. 

To estimate the variance in branch length estimates produced by the 
BS-REL model, we used a modified Latin hypercube sampling (LHS) im- 
portance resampling scheme (M = 500 samples), described in detail pre- 
viously (26). LHS is a standard technique that can be efficiently parallel- 
ized, an important consideration here. It is used to assess the variability of 
high-dimensional distributions by discretizing the volume of parameter 
space around the maximum-likelihood estimate into N bins (10,000 in 
our case) of approximately equal probability along each coordinate and 
sampling the bin for each parameter in a way that no two parameters share 
the same bin index (this technique maximizes space coverage). Samples 
are reweighted according to their likelihood, and resamples (M = 500) are 
drawn from this distribution. A modified version of LHS (26), which we 
use here, has also been shown to compare favorably to other techniques 
for the assessment of parameter uncertainty. 


RESULTS 

Recombination detection. Phylogenies with interpretable branch 
lengths can be inferred only when analyzing nonrecombinant re- 
gions. Therefore, we screened each of the five highly conserved 
coronavirus target regions with GARD (21). Within the Nsp15-16 
region, two recombination breakpoints were detected, and phylo- 
genetic incongruity was confirmed by a Kishino-Hasegawa test 
(P < 0.01) (27, 28). The first nonrecombinant section encom- 
passed Nsp15 and 80 nt of Nsp16, which we refer to as Nsp15*; the 
second section fell entirely within Nsp16. Because of its short 
length (300 nt), the third section was not included in later analy- 
ses. Within the RdRp region, a single recombination breakpoint 
was detected with GARD, but it is unlikely that this breakpoint 
represents a true recombination event since the two topologies 
were not significantly different according to the Kishino-Hase- 
gawa test. This putative breakpoint detected in the RdRp region 
was likely due to elevated rate variation across the sequence (21). 
No recombination breakpoints were detected in the other three 
loci. 

The maximum-likelihood phylogenies inferred from the six 
nonrecombinant loci were almost all significantly different from 
each other (Shimodaira-Hasegawa test, P < 0.05) (29). This find- 
ing suggests that the different regions of the coronavirus genome 
have distinct evolutionary histories and should not be treated as a 
single region in a phylogenetic analysis. The lone exception was 
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the comparison of the topologies from RdRp and Nsp15* (P = 
0.131). Nevertheless, we chose be conservative and analyze each of 
the six loci independently. 

Branch length expansion. We employed two models of nucle- 
otide sequence evolution (GTR + T°, and BS-REL) to re-estimate 
the branch lengths of the inferred maximum-likelihood phylog- 
enies. GTR + I’, is a standard nucleotide substitution model that 
has been commonly in many evolutionary virology studies. In 
contrast, the BS-REL model is a codon model that explicitly ac- 
counts for variation in selection pressures across sites and lineages 
(25). Both evolutionary models produced comparable length es- 
timates for shorter branches (=0.05 substitution per site in Fig. 2). 
However, compared to results inferred from BS-REL, long 
branches in the coronavirus phylogenies were underestimated by 
GTR + I, (Fig. 2). Notably, there were a substantial number of 
long branches in which the number of substitutions per site ap- 
proached saturation in the BS-REL model, ranging from 8 
branches in Nsp16 to 16 branches in the Y domain (Table 2). The 
lengths of these saturated branches cannot be reliably estimated, 
although a meaningful lower bound on their lengths can be ob- 
tained (e.g., with LHS; see below). 

Among the six nonrecombinant regions, there was substantial 
variation in the total tree length expansion between GTR + I’, and 
BS-REL, ranging over 3 orders of magnitude (Table 2). Differing 
selective pressures along lineages among loci appear to account for 
this variation, as we observed a correlation between the strength of 
purifying selection along a branch (i.e., lower mean w) and the 
relative increase in branch length under BS-REL compared to 
GTR + [, (Spearman’s nonparametric correlation, P = 0.0001; 
branches in which either model inferred a length of zero substitu- 
tions per site were excluded from this analysis). Thus, lineages that 
bear the mark of stronger purifying selection generally experi- 
enced a greater increase in length under the BS-REL model. More- 
over, the number of branches approaching saturation was corre- 
lated with a greater expansion in the total length under BS-REL 
(Spearman’s nonparametric correlation, P < 0.05). Interestingly, 
simpler, site-based measurements of selection did not correlate 
with total tree length expansion (e.g., mean dN/dS [P = 0.87] and 
number of sites of pervasive purifying selection [P = 0.46]), 
though analyzing only six loci limits the statistical power of these 
tests. Nonetheless, the relationship between w and branch length 
expansion supports the hypothesis that increased purifying selec- 
tion is responsible for the underestimation of the lengths of long 
branches in RNA virus phylogenies. 

The variance in branch length expansion from GTR + I, to 
BS-REL, as approximated with LHS, differed among coronavirus 
loci (Table 2). For the Nsp15*, Nsp16, matrix protein, and PLP2 
loci, the upper and lower 95% confidence limits differed by orders 
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FIG 2 Branch length expansion under the BS-REL model compared to that under the GTR + [', model for six coronavirus sequences: Nsp15*, Nsp16, matrix 
protein, PLP2, RdRp, and the Y domain. Each point represents a branch in the coronavirus phylogeny. For ease of visualization, BS-REL branch lengths 
experiencing extreme saturation (>50 substitutions per site) are depicted with infinite length. 


of magnitude, suggesting highly imprecise estimates of branch 
length expansion in these loci. This lack of precision is not sur- 
prising, given the extent to which these loci have experienced sat- 
uration along their phylogeny. Once a site becomes saturated with 
repeated substitutions, it is very difficult for an evolutionary 
model to determine the exact number of substitutions that have 
taken place. Therefore, the maximum-likelihood estimate will be 
imprecise. Nevertheless, these results clearly suggested that there 
is substantial underestimation of the evolutionary history of coro- 
naviruses. 

To examine the effects of saturation and determine the limit of 
reliable divergence inference with BS-REL, we simulated sequence 
alignments in HyPhy (24) under the maximum-likelihood pa- 


rameter values from the BS-REL model fitted to the RdRp locus. 
HyPhy scripts and input files needed to perform the simulations 
can be downloaded at https://github.com/veg/pubs/tree/master 
/SARS. The tree inferred by using this locus experienced a dra- 
matic expansion in a previous BS-REL analysis. For example, a 
single internal branch at this locus accounted for >70% of the 
total inferred tree length, but it also experienced incredibly strong 
purifying selection (dN/dS = 0) at 97.5% of the sites and very 
strong diversifying selection (dN/dS > 100) at the remaining 2.5% 
of the sites. In the simulations (n = 100), we kept the relative 
lengths of branches fixed and scaled the entire tree length from 1 
to 10,000 expected substitutions/site. We then fitted the GTR + P, 
and BS-REL models to the replicates to evaluate the saturation 


TABLE 2 Tree lengths generated under GTR+T’, and BS-REL evolutionary models 


No. of branches 


Proportion of sites 


Tree length (no. of 


approaching under pervasive subsanonsisite) Expansion 95% CI expaneion 

saturation purifying under under BS-REL 
Viral sequence Mean dN/dS* under BS-REL? selection® GTR+IT, BS-REL BS-REL (lower-upper) 
Nsp15* 0.17 12 0.77 11.3 10,530 935 659-100,725 
Nsp16 0.12 8 0.85 10.6 6,488 610 439-157,128 
Matrix 0.19 0.67 14.8 38,287 2,581 1,868—162,126 
PLP2 0.26 13 0.70 15.3 67,831 4,432 3,261—166,920 
RdRp 0.11 14 0.87 8.0 230,399 28,860 18,699—48,256 
Y domain 0.21 16 0.85 13.3 362,318 27,253 16,504—27,434 


* Inferred by a single-likelihood ancestral counting method (50). 

» Out of a total of 83 (40 internal) branches in the coronavirus phylogeny. 
“Inferred by a fixed-effects likelihood method (50). 

4 CIs were estimated by an LHS importance resampling scheme (M = 500 samples). 
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FIG 3 Substitution saturation curves obtained with the GTR + I, and 
BS-REL models on the basis of RdRp simulations. The solid line represents 
the simulated values, symbols are plotted at median values (100 replicates 
for each tree length value), and dotted whiskers show the 2.5-to-97.5% 
range. 


behavior of the models. There were striking differences between 
the models (Fig. 3). GIR + I, consistently, and progressively 
more badly, underestimated the simulated tree length, starting at 
the lowest simulated values; this behavior is due to a long internal 
branch predominantly subject to strong purifying selection, 
which poses a challenge for the nucleotide-based model (30). Note 
that this type of saturation does not lead to infinite branch lengths 
because a majority of the sites are maintained by strong purifying 
selection. In contrast, the BS-REL model did not saturate until the 
total length of the tree was on the order of 1,000 substitutions per 
site. Interestingly, BS-REL estimates are biased slightly upward, 
largely because of the saturation of nonsynonymous substitutions 
at positively selected sites and corresponding infinite values of 
dN/dS. 

tMRCA extrapolation. The current best estimate of the tMRCA 
of all four coronavirus genera of 10,141 years ago was obtained by 
Woo et al. (10) by using the RdRp locus; their tMRCA estimate is 
the only one that includes all four coronavirus genera. Notably, 
their tMRCA estimates for alpha- and betacoronaviruses are sim- 
ilar to the dates inferred by a similar methodology from a different 
data set (31). Our results suggest that, despite the comprehensive 
nature of these studies, there are intrinsic shortcomings in the 
standard evolutionary models used to estimate branch lengths and 
the tMRCA of ancient viral lineages like coronaviruses. Neverthe- 
less, Woo et al. calibrated the coronavirus molecular clock with 
serially sampled sequences, which rely predominantly on the re- 
lationship between the branches near the tips of the phylogeny to 
estimate substitution rates. Notably, it is possible to infer consis- 
tent RNA virus substitution rates from these types of serially sam- 
pled sequences, even if the tMRCA of older lineages deep in the 
tree cannot be reliably estimated because of saturation along in- 
ternal branches (e.g., simian immunodeficiency virus/human im- 
munodeficiency virus [32, 33] and avian influenza virus [17, 34]). 

When variation in selection pressure was taken into account 
using BS-REL, we observed a dramatic expansion of the total 
length of the RdRp phylogeny: 28,860.4-fold (95% confidence in- 
terval [CI], 18,695.7-fold to 48,256.0-fold) (Table 2). If one were 
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to extrapolate the evolutionary ages estimated by Woo et al. to the 
BS-REL phylogeny, the adjusted tMuRCA of coronaviruses would 
be 293 (95% CI, 190 to 489) million years ago, 4 orders of magni- 
tude greater than that previously inferred. The robustness of this 
tMRCA estimate is highly dependent on both the rate of evolution 
estimated by Woo et al. (10), which is inferred primarily from 
short branches near the tips of the phylogeny, and the accuracy of 
the new branch lengths, which appear extremely variable. Never- 
theless, even with the conservative lower bounds of the tMRCA of 
coronaviruses and branch length expansion (974 BCE [10] and 
18,695.7-fold expansion [Table 2]), the adjusted coronavirus 
tMRCA of 55.8 million years ago would still be 3 orders of mag- 
nitude greater than the current estimate. 


DISCUSSION 


Our results indicate that the evolutionary history of coronaviruses 
likely extends much further back in time than previous estimates 
have suggested. Across the coronavirus genome, there is evidence 
that standard nucleotide models underestimate the amount of 
evolution that has occurred by orders of magnitude; strong puri- 
fying selection had masked the evidence of thousands or millions 
of years of evolution in the coronavirus phylogeny. Like many 
other DNA and RNA viruses—including herpesviruses (35, 36), 
lentiviruses (37), bornaviruses (38), filoviruses (38, 39), and 
foamy viruses (40, 41)—coronaviruses appear to be an ancient 
viral lineage. Furthermore, our results demonstrate that purifying 
selection masking an ancient evolutionary history is not a phe- 
nomenon constrained to negative-sense RNA viruses (17) but can 
be seen in positive-sense RNA viruses like coronaviruses as well. 
Interestingly, our extrapolated estimate of the tMRCA of coro- 
naviruses infecting mammalian (alphacoronaviruses and beta- 
coronaviruses) and avian (gammacoronaviruses and deltacorona- 
viruses) species of 190 to 489 (mean of 293) million years ago 
corresponds to the inferred tMRCA of these host species based on 
fossil and molecular evidence of around 325 million years ago (42, 
43). It is tempting to speculate that this correspondence between 
dates is evidence of coevolution and codivergence between bat and 
avian species and the coronavirus genera. However, given the un- 
certainty associated with branch length estimation under strong 
selection (17) and the extreme saturation leading to a dramatic 
increase in the inferred tree length, our analyses may not provide 
an accurate estimation of the tMRCA of coronaviruses. Neverthe- 
less, our results strongly suggest that the 10,000-year-ago tMRCA 
of coronaviruses is underestimated by orders of magnitude. These 
results leave open the possibility that coronaviruses have been 
infecting bats and/or birds since the origin of these clades tens of 
millions of years ago or possibly since their divergence from each 
other in the carboniferous period, over 300 million years ago. This 
extrapolation, rather than be considered a reliable estimate of the 
coronavirus tMRCA, should be viewed as a biologically plausible 
hypothesis based on realistic parameters (e.g., patterns of substi- 
tution rates and selection profiles). We can no longer reject an 
ancient coevolutionary relationship based on the molecular clock. 
The degree to which host switching and codivergence have 
shaped coronavirus diversity remains unresolved. The observa- 
tion of recent viral cross-species transmission events among 
mammalian (9, 11) and avian (8) species is clear evidence of recent 
host switching. Conversely, the separation of mammalian and 
avian coronaviruses into distinct genera is suggestive of codiver- 
gence: mammalian coronaviruses (ie., alphacoronaviruses and 
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betacoronaviruses) are generally inferred to be reciprocally 
monophyletic (8, 10, 11). Therefore, a formal analysis of host 
switching and codivergence in coronaviruses would be useful for 
disentangling which sections of the coronavirus phylogeny repre- 
sent codivergence and which represent host-switching events. 

For the shorter branches (0.05 substitution per site) in the 
coronavirus phylogenetic trees, we found general agreement in 
the inferred branch lengths between evolutionary models (GTR + 
I’, and BS-REL); there was no evidence of underestimation of the 
lengths of short branches. Therefore, for closely related viral lin- 
eages involved in recent zoonotic transmission events (e.g., SARS 
coronavirus in humans, bats, and civet cats), previous dating es- 
timates (44) are consistent with our findings. Furthermore, this 
observation suggests that substitution rate estimates inferred from 
these recent outbreaks are robust to the biasing effects of purifying 
selection with respect to branch length estimation (though coales- 
cent effects may still have an impact on these recent evolutionary 
rate estimates [45—-47]). However, it remains unclear how the 
lower-than-average mutation rate of coronaviruses (10°°to10 ° 
mutation per site per replication) (15) translates into a typical 
short-term substitution rate (10~° substitution per site per year) 
(44). Further investigation on this topic is needed. 

BS-REL is an attractive tool for future studies of ancient virus 
evolution. The use of evolutionary models that allow for variable 
selection pressure across all branches in a phylogeny when esti- 
mating branch lengths is an advance over previous implementa- 
tions by Wertheim and Kosakovsky Pond (17, 25), which necessi- 
tate a priori identification of long internal branches bearing the 
mark of strong purifying selection (30). Unlike the phylogenetic 
trees in this previous study, which were characterized by closely 
related isolates separated by long internal branches, the coronavi- 
rus phylogenies are complicated, with a mixture of long and short 
branches interspersed throughout the tree. The BS-REL frame- 
work represents a flexible and powerful approach to the modeling 
of natural selection in the evolution of viruses (25), and it does not 
necessitate designating which branches experienced stronger or 
weaker selection pressures. 

Moreover, like our previous approach to the analysis of Ebola 
and avian influenza viruses, BS-REL found evidence of branches 
experiencing saturation. Because of the way in which BS-REL pa- 
rameterizes variable selection, the saturated branches were esti- 
mated to be longer in coronaviruses than in Ebola and avian in- 
fluenza viruses. 

The BS-REL model almost certainly overfits data on short 
branches with simple patterns of natural selection, which likely 
affects the accuracy of branch length estimates across the tree. In 
the case of coronaviruses, this overfitting is not a serious problem 
because the expansion of branch length estimates due to satura- 
tion is extreme and unlikely to produce precise estimates. A more 
parsimonious implementation of BS-REL would be useful for ad- 
dressing issues in viral evolution in which more precise branch 
length estimates are needed. Appropriate modeling of variation in 
the strength of natural selection will be integral for obtaining more 
accurate tMRCAs of viral lineages. Furthermore, it is possible that 
employing more realistic evolutionary models, for example, in 
maximum-likelihood or Bayesian tree searches, could improve 
the quality of viral phylogenetic inference. 

In summary, our results indicate that coronaviruses have an 
evolutionary history much longer than those suggested by phylo- 
genetic trees inferred by using standard nucleotide evolution 
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models. This finding allows for a coevolutionary relationship be- 
tween coronaviruses and their natural hosts. It is possible that 
such a long-term relationship has allowed some animal species 
(e.g., bats) to evolve strategies to coexist with coronaviruses and 
vice versa (48, 49). Further investigation of this topic might help 
us to better understand virus-host coevolution, the origin of coro- 
naviruses, and other related viral families in the order Nidovirales. 
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